Goto

Collaborating Authors

 data annotation


From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling

Lu, Yao, Ji, Zhaiyuan, Du, Jiawei, Shanqing, Yu, Xuan, Qi, Zhou, Tianyi

arXiv.org Artificial Intelligence

Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: https://github.com/Zhaiyuan-Ji/AutoAnnotator.


Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

Winata, Genta Indra, Anugraha, David, Liu, Emmy, Aji, Alham Fikri, Hung, Shou-Yi, Parashar, Aditya, Irawan, Patrick Amadeus, Zhang, Ruochen, Yong, Zheng-Xin, Cruz, Jan Christian Blaise, Muennighoff, Niklas, Kim, Seungone, Zhao, Hanyang, Kar, Sudipta, Suryoraharjo, Kezia Erina, Adilazuarda, M. Farid, Lee, En-Shiun Annie, Purwarianti, Ayu, Wijaya, Derry Tanti, Choudhury, Monojit

arXiv.org Artificial Intelligence

High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.


Incentivizing High-Quality Human Annotations with Golden Questions

Liu, Shang, Cai, Zhongze, Wang, Hanzhao, Ma, Zhongyao, Li, Xiaocheng

arXiv.org Machine Learning

Human-annotated data plays a vital role in training large language models (LLMs), such as supervised fine-tuning and human preference alignment. However, it is not guaranteed that paid human annotators produce high-quality data. In this paper, we study how to incentivize human annotators to do so. We start from a principal-agent model to model the dynamics between the company (the principal) and the annotator (the agent), where the principal can only monitor the annotation quality by examining $n$ samples. We investigate the maximum likelihood estimators (MLE) and the corresponding hypothesis testing to incentivize annotators: the agent is given a bonus if the MLE passes the test. By analyzing the variance of the outcome, we show that the strategic behavior of the agent makes the hypothesis testing very different from traditional ones: Unlike the exponential rate proved by the large deviation theory, the principal-agent model's hypothesis testing rate is of $Θ(1/\sqrt{n \log n})$. Our theory implies two criteria for the \emph{golden questions} to monitor the performance of the annotators: they should be of (1) high certainty and (2) similar format to normal ones. In that light, we select a set of golden questions in human preference data. By doing incentive-compatible experiments, we find out that the annotators' behavior is better revealed by those golden questions, compared to traditional survey techniques such as instructed manipulation checks.


LLMs as Data Annotators: How Close Are We to Human Performance

Haq, Muhammad Uzair Ul, Rigoni, Davide, Sperduti, Alessandro

arXiv.org Artificial Intelligence

In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately $7$B and $70$B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.


How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond

Huang, Chen, Deng, Yang, Lei, Wenqiang, Lv, Jiancheng, Chua, Tat-Seng, Huang, Jimmy Xiangji

arXiv.org Artificial Intelligence

Advancements in NLP research have been greatly Given all these elements, the information propelled by large language models (LLMs), which on particular details about how to formalize an have showcased exceptional abilities (Zhao et al., effective human-model cooperation to achieve 2023; Laskar et al., 2024). These advancements are collective outputs is rather under-specified and paving the way for the development of AI models scattered. Therefore, a comprehensive and systematic that can behave as autonomous agents, working analysis of the underlying principles and alongside humans to tackle intricate tasks. These formalizations of human-model cooperation is still models, for example, can cooperate with humans absent. This gap in understanding presents a significant on data annotation (Klie et al., 2020; Li et al., opportunity for advancement, enabling us 2023a; Huang et al., 2024c), information seeking to develop a deeper understanding of the fundamental (Deng et al., 2023a; Wang et al., 2023b; Zhang basics that govern the effective cooperation et al., 2024d), creative writing (Padmakumar and between humans and intelligent models. He, 2022; Akoury et al., 2020) and real-world problem To fill this gap, in this survey, we take the first solving (Mehta et al., 2023; Feng et al., 2024; step to summarize the principles, formalizations, Qian et al., 2024).


Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

Artemova, Ekaterina, Tsvigun, Akim, Schlechtweg, Dominik, Fedorova, Natalia, Tilga, Sergei, Chernyshev, Konstantin, Obmoroshev, Boris

arXiv.org Artificial Intelligence

Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.


On Limitations of LLM as Annotator for Low Resource Languages

Jadhav, Suramya, Shanbhag, Abhay, Thakurdesai, Amogh, Sinare, Ridhima, Joshi, Raviraj

arXiv.org Artificial Intelligence

Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate models and datasets, making it difficult to perform critical NLP tasks like sentiment analysis or hate speech detection. To bridge this gap, Large Language Models (LLMs) present an opportunity for potential annotators, capable of generating datasets and resources for these underrepresented languages. In this paper, we focus on Marathi, a low-resource language, and evaluate the performance of both closed-source and open-source LLMs as annotators. We assess models such as GPT-4o and Gemini 1.0 Pro, Gemma 2 (2B and 9B), and Llama 3.1 (8B) on classification tasks including sentiment analysis, news classification, and hate speech detection. Our findings reveal that while LLMs excel in annotation tasks for high-resource languages like English, they still fall short when applied to Marathi. Even advanced closed models like Gemini and GPT underperform in comparison to BERT-based baselines, highlighting the limitations of LLMs as annotators for low-resource languages.


LLM Chain Ensembles for Scalable and Accurate Data Annotation

Farr, David, Manzonelli, Nico, Cruickshank, Iain, Starbird, Kate, West, Jevin

arXiv.org Artificial Intelligence

Abstract--The ability of large language models (LLMs) to perform zero-shot classification makes them viable solutions for data annotation in rapidly evolving domains where quality labeled data is often scarce and costly to obtain. However, the large-scale deployment of LLMs can be prohibitively expensive. This paper introduces an LLM chain ensemble methodology that aligns multiple LLMs in a sequence, routing data subsets to subsequent models based on classification uncertainty. This approach leverages the strengths of individual LLMs within a broader system, allowing each model to handle data points where it exhibits the highest confidence, while forwarding more complex cases to potentially more robust models. Our results show that the chain ensemble method often exceeds the performance of the best individual model in the chain and achieves substantial cost savings, making LLM chain ensembles a practical and efficient solution for large-scale data annotation challenges.


Can Vision-Language Models Replace Human Annotators: A Case Study with CelebA Dataset

Lu, Haoming, Zhong, Feifei

arXiv.org Artificial Intelligence

This study evaluates the capability of Vision-Language Models (VLMs) in image data annotation by comparing their performance on the CelebA dataset in terms of quality and cost-effectiveness against manual annotation. Annotations from the state-of-the-art LLaVA-NeXT model on 1000 CelebA images are in 79.5% agreement with the original human annotations. Incorporating re-annotations of disagreed cases into a majority vote boosts AI annotation consistency to 89.1% and even higher for more objective labels. Cost assessments demonstrate that AI annotation significantly reduces expenditures compared to traditional manual methods--representing less than 1% of the costs for manual annotation in the CelebA dataset. These findings support the potential of VLMs as a viable, costeffective alternative for specific annotation tasks, reducing both financial burden and ethical concerns associated with large-scale manual data annotation. The AI annotations and re-annotations utilized in this study are available on GitHub.


Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMs

Wang, Yifan, Stevens, David, Shah, Pranay, Jiang, Wenwen, Liu, Miao, Chen, Xu, Kuo, Robert, Li, Na, Gong, Boying, Lee, Daniel, Hu, Jiabo, Zhang, Ning, Kamma, Bob

arXiv.org Artificial Intelligence

The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO's efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.